Statistical inference, part III

Point and interval estimates

Eva Freyhult

NBIS, SciLifeLab

April 23, 2024

Point estimate

Unknown population parameters can be inferred from estimates from random samples from the population of interest. The sample estimate will be our best guess, a point estimate, of the population parameter.

The sample proportion and sample mean are unbiased estimates of the population proportion and population mean.

The expected value of an unbiased point estimate is the the population parameter that it estimates.

The sample estimate is our best guess, but it will not be without error.

Bias and precision

Figure 1: Bias and precision.

Interval estimates

To show the uncertainty an interval estimate for a population parameter can be computed based on sample data, instead of just a point estimate.

An interval estimate is an interval of possible values that with high probability contains the true population parameter.

The width of the interval estimate can be determined from the sampling distribution.

Bootstrap interval

If the sampling distribution of the sample statistic of interest is unknown, a bootstrap interval can be computed instead.

Bootstrap is to use the data we have (our sample) and sample repeatedly with replacement from this sample.

Put the entire sample in an urn and resample!

Bootstrap interval

Pollen example

If we are interested in how large proportion of the Uppsala population is allergic to pollen, we can investigate this by studying a random sample. We randomly select 100 persons in Uppsala and observe that 42 have a pollen allergy.

Based on this observation our point estimate of the Uppsala popultation proportion \(\pi\) is \(\pi \approx p = 0.42\).

Sample from the urn with replacement to compute the bootstrap distribution.

Bootstrap interval

Pollen example

Sample an object with replacement 100 times and note the proportion allergic (black balls).

Repeat this many times to get a bootstrap distribution

Using the bootstrap distribution the uncertainty of our estimate of \(\pi\) can be estimated.

The 95% bootstrap interval is [0.32, 0.52].

The bootstrap is very useful if you do not know the distribution of our sampled propery. But in our example we actually do.

Confidence interval

A confidence interval is a type of interval estimate associated with a confidence level.

An interval that with probability \(1 - \alpha\) cover the population parameter \(\theta\) is called a confidence interval for \(\theta\) with confidence level \(1 - \alpha\).

Sampling distribution of mean

\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

Sampling distribution of mean

\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

Sampling distribution of mean

\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

Sampling distribution of mean

\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

Confidence interval of mean

If \(\sigma\) is known

\[Z = \frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0, 1)\]

Standard normal distribution

\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\]

Standard normal distribution

\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\]

\(z_{\alpha/2}\) is the value such that \(P(Z \geq z_{\alpha/2}) = \frac{\alpha}{2} \iff P(Z \leq z_{\alpha/2}) = 1 - \frac{\alpha}{2}\).

For a 95% confidence, \(\alpha = 0.05\), and \(z_{\alpha/2} = 1.96\). For 90% or 99% confidence \(z_{0.05} = 1.64\) and \(z_{0.005}=2.58\).

Confidence interval of mean

If \(\sigma\) is known

\[Z = \frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0, 1)\] From the standard normal distribution we know;

\[P(z_{\alpha/2}<Z<z_{\alpha/2}) = 1-\alpha\]

\[P(z_{\alpha/2}<\frac{\bar X-\mu}{SEM}<z_{\alpha/2}) = 1-\alpha\]

\[P(\mu-z_{\alpha/2}SEM<\bar X<\mu+z_{\alpha/2}SEM) = 1-\alpha\]

\[P(\bar X-z_{\alpha/2}SE<\mu<\bar X+z_{\alpha/2}SE) = 1-\alpha\]

Confidence interval of mean

If \(\sigma\) is known

\[Z = \frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0, 1)\]

The confidence interval with confidence level \(1-\alpha\);

\[[\bar x_{obs} - z_{\alpha/2}SEM, \bar x_{obs} + z_{\alpha/2}SEM]\]

or

\[\mu = \bar x_{obs} \pm z_{\alpha/2}SEM\] where \(SEM = \frac{\sigma}{\sqrt{n}}\).

Confidence interval of mean

The mean of a sample of \(n\) independent and identically normal distributed observations \(X_i\) is normally distributed;

\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

If \(\sigma\) is unknown and \(n\) is small?

Use the statistic \(t=\frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{s}{\sqrt{n}}} \sim t(n-1)\), t-distributed with \(n-1\) degrees of freedom.

It follows that

\[ \begin{aligned} P\left(-t < \frac{\bar X - \mu}{\frac{s}{\sqrt{n}}} < t\right) = 1 - \alpha \iff \\ P\left(\bar X - t \frac{s}{\sqrt{n}} < \mu < \bar X + t \frac{}{\sqrt{n}}\right) = 1 - \alpha \end{aligned} \]

The confidence interval;

\[[\bar x_{obs} - t \frac{s}{\sqrt{n}}, \bar x_{obs} + t \frac{s}{\sqrt{n}}]\]

or

\[\mu = \bar x_{obs} \pm t \frac{s}{\sqrt{n}}\]

Confidence interval of mean

The confidence interval with confidence level \(1-\alpha\) is thus;

\[\mu = \bar x_{obs} \pm t \frac{s}{\sqrt{n}}\]

For a 95% confidence interval and \(n=5\), \(t=\) 2.7764.

The \(t\) values for different values of \(\alpha\) and degrees of freedom are tabulated and can be computed in R using the function qt.

n=5
alpha = 0.05
## t value
qt(1-alpha/2, df=n-1)
[1] 2.776

Example

You study the BMI of male diabetic patients. In a sample of size 6 you observe; \(27, 25, 31, 29, 30, 22\). Assume that the BMI is normally distributed and calculate a 95% confidence interval for the mean BMI in male diabetic patients.

Confidence interval of proportions

Remember that we can use the central limit theorem to show that

\[P \sim N\left(\pi, SE\right) \iff P \sim \left(\pi, \sqrt{\frac{\pi(1-\pi)}{n}}\right)\]

It follows that

\[Z = \frac{P - \pi}{SE} \sim N(0,1)\] Based on what we know of the standard normal distribution, we can compute an interval around the population property \(\pi\) such that the probability that a sample property \(p\) falls within this interval is \(1-\alpha\).

Confidence interval of proportion

\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\\ P(-z_{\alpha/2} < \frac{P - \pi}{SE} < z_{\alpha/2}) = 1 - \alpha\]

We can rewrite this to

\[P\left(\pi-z_{\alpha/2} SE < P < \pi + z_{\alpha/2} SE\right) = 1-\alpha\] In words, a sample fraction \(p\) will fall between \(\pi \pm z_{\alpha/2} SE\) with probability \(1- \alpha\).

The equation can also be rewritten to

\[P\left(P-z SE < \pi < P + z SE\right) = 1 - \alpha\]

Confidence interval of proportion

The observed confidence interval is what we get when we replace the random variable \(P\) with our observed fraction,

\[p-z SE < \pi < p + z SE\] \[\pi = p \pm z SE = p \pm z \sqrt{\frac{p(1-p)}{n}}\]

Confidence interval of proportion

The 95% confidence interval \[\pi = p \pm 1.96 \sqrt{\frac{p(1-p)}{n}}\]

Confidence interval of proportion

A 95% confidence interval will have 95% chance to cover the true value.

Confidence interval of proportion

Back to our example of proportion pollen allergic in Uppsala. \(p=0.42\) and \(SE=\sqrt{\frac{p(1-p)}{n}} = 0.0494\).

Hence, the 95% confidence interval is \[\pi = 0.42 \pm 1.96 * 0.05 = 0.42 \pm 0.092\] or \[(0.42-0.092, 0.42+0.092) = (0.32, 0.52)\]